Background, Overview, and Motivation:

There is no community that feels the convergence of government policies and in-home culture more than women and children. We are interested in studying the population of mothers and infants due to their unique vulnerability, and their health outcomes which are disproportionately affected by various exposures and government policies. Our group members had interests in both environmental health and perinatal health, so we decided to assess the effect of environmental exposures on maternal and infant health outcomes. We choose to restrict our analysis of these exposures and outcomes to only California due to both data availability constraints, the state’s population and racial diversity, its comprehensive and transparent environmental regulations, and its high use of pesticides. California is the greatest user of pesticides in the US with over 85 million kg applied annually, an amount equivalent to roughly 30% of the cumulative active ingredients applied to US agriculture.

In 1986, California passed “The Safe Drinking Water and Toxic Enforcement Act of 1986” also known as Proposition 65. Proposition 65 requires businesses to provide warnings regarding significant exposures to chemicals that cause cancer, birth defects or other reproductive harm. By requiring that this information be provided, Proposition 65 enables Californians to make informed decisions about their exposures to these chemicals. California has a list of harmful chemicals as characterized by Proposition 65, which is updated at least once a year and includes over 900 chemicals. Proposition 65 has motivated businesses to eliminate or reduce toxic chemicals in numerous consumer products and has led to the safer reformulation of many products. The law has also been successful in educating the general public about exposures to toxic chemicals in consumer products, buildings, and the environment, which as a result created a demand and market reward for less-toxic products.

California is the most populous state in the United States (with roughly 12% of annual births) and is a very diverse state in regards to both demographics and landscape across its various counties, which lends to increased diversity in the state level data. Due to California’s diverse population, we were able to assess exposure to racism (recorded as race) as a potential confounder to maternal health outcomes. The inclusion of analysis and a discussion surrounding race and racism is critical when doing any research in the field of maternal health. The literature has no shortage of evidence pointing to disproportionately adverse health outcomes for Black mothers and babies in America. Infant mortality rates for America’s Black babies are more than twice the rate of white babies and they are more than three times as likely to die from complications related to low birth weight. U.S. maternal mortality rates for Black women are also three to four times higher than rates for white women. For this reason we decided to classify “the experience of racism” as a confounder in our analysis and to stratify by race to account for this confounding.

Initial Questions & Research Question Evolution:

Our initial question was pretty broad: “What is the effect of pesticide use on Maternal and Child Health?” We then narrowed down our scope to include only data from California (inspired by our background research and related work). Over the course of the project, we defined our exposure as pesticide use (continuous variable measured in pounds). We also narrowed down our outcomes of interest to include: fertility, birth weight, and gestational age. We further narrowed our scope of counties to focus on those with high pesticide use and high agricultural activities due to the focus on these areas in the literature we reviewed (mentioned in the “related work” section). The counties we focused on were ranked in our data as the top 4 counties for highest pesticide use and they included: Fresno, Kern, Tulare, and San Joaquin (in that order). We chose to include Los Angeles as a comparison group for the exploratory analysis of the maternal and child health data because it is one of the most populated and most diverse counties in California, and this was of importance to us due to our interest in examining the maternal and infant health outcomes, stratified by race.

Data Sources:

*See notes regarding scraping, cleaning, and wrangling methods at each respective code chunk

Data Source for Pesticides:

Pesticide use for California counties data was retrieved from the California Department of Pesticide Regulation- Pesticide use reporting program https://www.cdpr.ca.gov/docs/pur/purmain.htm

Maternal and Child Health Data Source(s):

Maternal and Child Health outcome data was mainly obtained from the following three sources:

1, California Open Data Portal https://data.ca.gov/dataset/live-births-with-low-and-very-low-birthweight

  1. CHHS Open Data https://data.chhs.ca.gov/dataset/preterm-and-very-preterm-live-births/resource/cff79e2d-6ecf-4158-9e4f-7078632220ee

  2. Centers for Disease Control and Prevention (CDC) Natality Online Database on the Wide-ranging OnLine Data for Epidemiologic Research (WONDER) system Natality, 2007-2019 Request Centers for Disease Control and Prevention (CDC) Natality Online Database on the Wide-ranging OnLine Data for Epidemiologic Research (WONDER) system https://wonder.cdc.gov/natality-current.html

Exploratory Analysis:

Final Analysis:

Load Packages

library(dplyr)
library(tidyverse)
library(pdftools)
library(readr)
library(stringr)
library(ggthemes)
library(shiny)
library(shinyBS)
library(RColorBrewer)
library(shinydashboard)
library(sp)
library(rgeos)
library(rgdal)
library(maptools)
library(leaflet)
library(scales)
library(maps)
library(gridExtra)
library(grid)

MCH Data Wrangling *Low BW Only (for map)

We wanted to create a map to visualize the pattern of low birth weight in California. I initially used CDC WONDER database but it had very limited information, where it grouped rural and small county-level data to “unidentified counties.” Thus, I consulted another data source from California Open Data Portal. Following is the data cleaning process:

#Data Wrangling for Year 2014-2018 Data for Map.  
lbwdata<-read.csv("./low-and-very-low-birthweight-by-county-2014-2018 (1).csv", header = TRUE, stringsAsFactors = FALSE)
lbwdata <- lbwdata %>% mutate(County = str_to_title(County))
lbwdata$Events[is.na(lbwdata$Events)] <- 0
lbwdata <- lbwdata %>% group_by(Year, County, Total.Births) %>% summarize(Events = sum(Events)) 
## `summarise()` regrouping output by 'Year', 'County' (override with `.groups` argument)
lbwdata <- lbwdata %>% filter(!County == "california") 
lbwdata <- lbwdata %>% mutate(Rate = Events/Total.Births)

MCH Data Wrangling *Preterm Birth Only (for map)

We also wanted to create another map to visualize the pattern of preterm birth in California. Again, I ran into a similar problem using CDC WONDER database. Thus, I consulted the CHHS database Following is the data cleaning process:

ptbirthdata<- read.csv("preterm-and-very-preterm-births-by-county-2010-2018-3.csv", header = TRUE, stringsAsFactors = FALSE)
ptbirthdata$Events[is.na(ptbirthdata$Events)] <- 0
ptbirthdata <- ptbirthdata[,-c(7,8)]
ptbirthdata <- ptbirthdata %>% group_by(Year, County, Total.Births) %>% summarize(Events = sum(Events)) 
## `summarise()` regrouping output by 'Year', 'County' (override with `.groups` argument)
ptbirthdata <- ptbirthdata %>% filter(!County == "california")#removing the total count
ptbirthdata <- ptbirthdata %>% mutate(rate_pt = Events/Total.Births * 100)

Code for Creating the Map - Leaflet Map Using LBW Data (See Shiny App for the Final Result)

After cleaning the data set, I then looked at creating a “spatial” map. Following is the wrangling process of generating a map that shapes the map of California, and merging that spatial object with low birth weight data that I wrangled earlier to generate a leaflet map. My main motivation of using a leaflet map was because I wanted to create a map where the user can see which county is which and is able to zoom in and out. Note that there are counties that had NA cases (perhaps for counties that had a very small population).

map <- readOGR(path.expand("cb_2018_us_county_20m.shp"),
               layer = "cb_2018_us_county_20m", stringsAsFactors = FALSE)
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\zay-z\Documents\Harvard Chan\Fall 2020\BST260\datascience-project\Data Prep (& Final RMD)\cb_2018_us_county_20m.shp", layer: "cb_2018_us_county_20m"
## with 3220 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER
Statekey<-read.csv('./STATEFPtoSTATENAME_Key.csv', colClasses=c('character'))
map<-merge(x=map, y=Statekey, by="STATEFP", all=TRUE)
SingleState <- subset(map, map$STATENAME %in% c(
    "California"
))

lbwdata_2016 <- lbwdata %>% filter(Year == "2016") %>% mutate(Rate = Events/Total.Births*100)
spatial_lbw <-sp::merge(x=SingleState, y=lbwdata_2016, by.x="NAME", by.y="County", by=x)

bins <- c(4.0,6.3,7.6,8.1, Inf)
pal <- colorBin(
    palette = "viridis",
    domain = spatial_lbw$Rate, n=7, bins=bins)

leaflet(spatial_lbw, options = leafletOptions(zoomControl = TRUE, zoomLevelFixed = FALSE, dragging=TRUE, minZoom = 5.3, maxZoom = 9)) %>% 
            setView(lat = 36.778259, lng = -119.417931, zoom = 6) %>%
            addPolygons(color = "Black", weight = 1, smoothFactor = 0.5, 
                        opacity = 1.0, fillOpacity = 0.5, layerId = ~NAME,
                        fillColor = ~pal(Rate), 
                        popup = ~as.factor(paste0("<b><font size=\"4\"><center>County: </b>",spatial_lbw$NAME,"</font></center>","<b>% of Low Birth Weight Births: </b>", sprintf("%1.2f%%", spatial_lbw$Rate),"<br/>"))) %>%
            addLegend(pal = pal, values = spatial_lbw$Rate, opacity = 1, title="% Low Birth Weight (Quartiles)")
## Warning in pal(Rate): Some values were outside the color scale and will be
## treated as NA

Code for Creating the Map - Leaflet Map Using Preterm Birth Data Wrangled Earlier (See Shiny App for the Final Result)

This is a similar spatial map but for preterm birth. Following is the wrangling process of generating a map that shapes the map of California, and merging that spatial object with pre-term birth data that I wrangled earlier to generate a leaflet map. Note that there are counties that had NA cases (perhaps for counties that had a very small population).

map <- readOGR(path.expand("cb_2018_us_county_20m.shp"),
               layer = "cb_2018_us_county_20m", stringsAsFactors = FALSE)
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\zay-z\Documents\Harvard Chan\Fall 2020\BST260\datascience-project\Data Prep (& Final RMD)\cb_2018_us_county_20m.shp", layer: "cb_2018_us_county_20m"
## with 3220 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER
Statekey<-read.csv('./STATEFPtoSTATENAME_Key.csv', colClasses=c('character'))
map<-merge(x=map, y=Statekey, by="STATEFP", all=TRUE)
SingleState <- subset(map, map$STATENAME %in% c(
    "California"
))

ptbirthdata_2016 <- ptbirthdata %>% filter(Year == "2016") 
spatial_pt <-sp::merge(x=SingleState, y=ptbirthdata_2016, by.x="NAME", by.y="County", by=x)

bin <- c(5.5, 8.2, 9.1, 9.9, Inf)
pal2 <- colorBin(
    palette = "plasma",
    domain = spatial_pt$rate_pt, n=7, bins=bin)

leaflet(spatial_pt, options = leafletOptions(zoomControl = TRUE, zoomLevelFixed = FALSE, dragging=TRUE, minZoom = 5.3, maxZoom = 9)) %>% 
                    setView(lat = 36.778259, lng = -119.417931, zoom = 6) %>%
                    addPolygons(color = "Black", weight = 1, smoothFactor = 0.5, 
                                opacity = 1.0, fillOpacity = 0.5, layerId = ~NAME,
                                fillColor = ~pal2(rate_pt), 
                                popup = ~as.factor(paste0("<b><font size=\"4\"><center>County: </b>",spatial_pt$NAME,"</font></center>","<b>% of Preterm Birth: </b>", sprintf("%1.2f%%", spatial_pt$rate_pt),"<br/>"))) %>% addLegend(pal = pal2, values = spatial_pt$rate_pt, opacity = 1, title="% Preterm Birth (Quartiles)")
## Warning in pal2(rate_pt): Some values were outside the color scale and will be
## treated as NA

Maternal and Child Health Indicators Data Wrangling of CDC Data

The data sets used for the exploratory analysis of MCH indicators by county in California were downloaded from the CDC Wonder Database in “.txt” format. I read the .txt files into the rmd file and turned them into data frames. The first data frame, MCH.CDC.Data had Maternal and Infant Health Outcomes by county over the years, the MCH.CDC.Data_Race had the same variables as the MCH.CDC.Data frame but was stratified by Mother’s Race. I also renamed all the counties in these two data frames to match the same names (re: case and format) as the counties in the pesticide data frames for easier comparison of the variables in these two data frames when comparing by county. Below is the data wrangling and cleaning code for the Maternal and Child Health Data from the CDC Wonder Source.

#Data Wrangling for CDC Data (COMPLETE)
MCH.CDC.Data <- read.delim("NatalityTOTAL.txt",  sep ="\t", dec=".", header = TRUE, stringsAsFactors = FALSE)
MCH.CDC.Data <- MCH.CDC.Data[-c(491:585), ]
MCH.CDC.Data <- MCH.CDC.Data %>% filter(Notes != "Total")
MCH.CDC.Data <- MCH.CDC.Data[ ,-c(1,3,5,7,9)]

MCH.CDC.Data_Race <- read.delim("NatalityRACE.txt",  sep ="\t", dec=".", header = TRUE, stringsAsFactors = FALSE)
MCH.CDC.Data_Race <- MCH.CDC.Data_Race[-c(1794:1931), ]
MCH.CDC.Data_Race <- MCH.CDC.Data_Race[ ,-c(1,3,5,7,9,11)]

#Rename Counties to Match Pesticide Data
MCH.CDC.Data[MCH.CDC.Data$County == "Alameda County, CA", "County"] <-"Alameda"
MCH.CDC.Data[MCH.CDC.Data$County == "Butte County, CA", "County"] <-"Butte"
MCH.CDC.Data[MCH.CDC.Data$County == "Contra Costa County, CA", "County"] <-"Contra Costa"
MCH.CDC.Data[MCH.CDC.Data$County == "El Dorado County, CA", "County"] <-"El Dorado"
MCH.CDC.Data[MCH.CDC.Data$County == "Fresno County, CA", "County"] <-"Fresno"
MCH.CDC.Data[MCH.CDC.Data$County == "Humboldt County, CA", "County"] <-"Humboldt"
MCH.CDC.Data[MCH.CDC.Data$County == "Imperial County, CA", "County"] <-"Imperial"
MCH.CDC.Data[MCH.CDC.Data$County == "Kern County, CA", "County"] <-"Kern"
MCH.CDC.Data[MCH.CDC.Data$County == "Kings County, CA", "County"] <-"Kings"
MCH.CDC.Data[MCH.CDC.Data$County == "Los Angeles County, CA", "County"] <-"Los Angeles"
MCH.CDC.Data[MCH.CDC.Data$County == "Madera County, CA", "County"] <-"Madera"
MCH.CDC.Data[MCH.CDC.Data$County == "Marin County, CA", "County"] <-"Marin"
MCH.CDC.Data[MCH.CDC.Data$County == "Contra Costa County, CA", "County"] <-"Mariposa"
MCH.CDC.Data[MCH.CDC.Data$County == "Merced County, CA", "County"] <-"Merced"
MCH.CDC.Data[MCH.CDC.Data$County == "Monterey County, CA", "County"] <-"Monterey"
MCH.CDC.Data[MCH.CDC.Data$County == "Napa County, CA", "County"] <-"Napa"
MCH.CDC.Data[MCH.CDC.Data$County == "Orange County, CA", "County"] <-"Orange"
MCH.CDC.Data[MCH.CDC.Data$County == "Placer County, CA", "County"] <-"Placer"
MCH.CDC.Data[MCH.CDC.Data$County == "Riverside County, CA", "County"] <-"Riverside"
MCH.CDC.Data[MCH.CDC.Data$County == "Sacramento County, CA", "County"] <-"Sacramento"
MCH.CDC.Data[MCH.CDC.Data$County == "San Bernardino County, CA", "County"] <-"San Bernardino"
MCH.CDC.Data[MCH.CDC.Data$County == "San Diego County, CA", "County"] <-"San Diego"
MCH.CDC.Data[MCH.CDC.Data$County == "San Francisco County, CA", "County"] <-"San Francisco"
MCH.CDC.Data[MCH.CDC.Data$County == "San Joaquin County, CA", "County"] <-"San Joaquin"
MCH.CDC.Data[MCH.CDC.Data$County == "San Luis Obispo County, CA", "County"] <-"San Luis Obispo"
MCH.CDC.Data[MCH.CDC.Data$County == "San Mateo County, CA", "County"] <-"San Mateo"
MCH.CDC.Data[MCH.CDC.Data$County == "Santa Barbara County, CA", "County"] <-"Santa Barbara"
MCH.CDC.Data[MCH.CDC.Data$County == "Santa Clara County, CA", "County"] <-"Canta Clara"
MCH.CDC.Data[MCH.CDC.Data$County == "Santa Cruz County, CA", "County"] <-"Santa Cruz"
MCH.CDC.Data[MCH.CDC.Data$County == "Shasta County, CA", "County"] <-"Shasta"
MCH.CDC.Data[MCH.CDC.Data$County == "Solano County, CA", "County"] <-"Solano"
MCH.CDC.Data[MCH.CDC.Data$County == "Sonoma County, CA", "County"] <-"Sonoma"
MCH.CDC.Data[MCH.CDC.Data$County == "Stanislaus County, CA", "County"] <-"Stanislaus"
MCH.CDC.Data[MCH.CDC.Data$County == "Tulare County, CA", "County"] <-"Tulare"
MCH.CDC.Data[MCH.CDC.Data$County == "Ventura County, CA", "County"] <-"Ventura"
MCH.CDC.Data[MCH.CDC.Data$County == "Yolo County, CA", "County"] <-"Yolo"

MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Alameda County, CA", "County"] <-"Alameda"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Butte County, CA", "County"] <-"Butte"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Contra Costa County, CA", "County"] <-"Contra Costa"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "El Dorado County, CA", "County"] <-"El Dorado"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Fresno County, CA", "County"] <-"Fresno"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Humboldt County, CA", "County"] <-"Humboldt"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Imperial County, CA", "County"] <-"Imperial"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Kern County, CA", "County"] <-"Kern"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Kings County, CA", "County"] <-"Kings"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Los Angeles County, CA", "County"] <-"Los Angeles"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Madera County, CA", "County"] <-"Madera"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Marin County, CA", "County"] <-"Marin"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Contra Costa County, CA", "County"] <-"Mariposa"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Merced County, CA", "County"] <-"Merced"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Monterey County, CA", "County"] <-"Monterey"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Napa County, CA", "County"] <-"Napa"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Orange County, CA", "County"] <-"Orange"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Placer County, CA", "County"] <-"Placer"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Riverside County, CA", "County"] <-"Riverside"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Sacramento County, CA", "County"] <-"Sacramento"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "San Bernardino County, CA", "County"] <-"San Bernardino"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "San Diego County, CA", "County"] <-"San Diego"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "San Francisco County, CA", "County"] <-"San Francisco"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "San Joaquin County, CA", "County"] <-"San Joaquin"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "San Luis Obispo County, CA", "County"] <-"San Luis Obispo"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "San Mateo County, CA", "County"] <-"San Mateo"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Santa Barbara County, CA", "County"] <-"Santa Barbara"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Santa Clara County, CA", "County"] <-"Santa Clara"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Santa Cruz County, CA", "County"] <-"Santa Cruz"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Shasta County, CA", "County"] <-"Shasta"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Solano County, CA", "County"] <-"Solano"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Sonoma County, CA", "County"] <-"Sonoma"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Stanislaus County, CA", "County"] <-"Stanislaus"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Tulare County, CA", "County"] <-"Tulare"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Ventura County, CA", "County"] <-"Ventura"
MCH.CDC.Data_Race[MCH.CDC.Data_Race$County == "Yolo County, CA", "County"] <-"Yolo"

MCH.CDC.Data_Race<- MCH.CDC.Data_Race %>% rename("Mothers.Race" = "Mother.s.Bridged.Race")

Data Key:

Exploratory Data Analysis of Maternal and Infant Health Indicators from the CDC Wonder Database:

Fertility Rate

The first variable I examined was fertility rate and I first visualized the fertility rates across all counties over the years in a tile plot. I then viewed the trend in our counties of interest which include Fresno, Kern, Tulare, San Joaquin (since they are continuously ranked as the top 4 in highest use of pesticides), and Los Angeles (as a control/comparison county) since LA is one of the most populated and most diverse counties in California. Fresno, Kern, Tulare, and San Joaquin counties are also all a part of San Joaquin Valley which we mentioned in our background and related work section to be of special interest to us because it is California’s most productive agricultural region and has one of the highest amounts of pesticide use.

#Fertility Rate GGPLOTS
MCH.CDC.Data %>% 
    ggplot(aes(x = Year, y = County,  fill = Fertility.Rate)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Fertility Rate", limits = c(30,100),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Fertility Rate by County") + 
    ylab("") + xlab("")

#Fresno, Ranked #1 in Pesticide Use
MCH.CDC.Data %>% group_by(County) %>% filter(County == "Fresno") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Fresno County", subtitle = "2007-2019", color = "County", caption = "Data Source: CDC WONDER Online Database") + ylim(30,100)

#Kern, Ranked #2 in Pesticide Use
MCH.CDC.Data %>% group_by(County) %>% filter(County == "Kern") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Kern County", subtitle = "2007-2019", color = "County", caption = "Data Source: CDC WONDER Online Database") + ylim(30,100)

#Tulare, Ranked #3 in Pesticide Use
MCH.CDC.Data %>% group_by(County) %>% filter(County == "Tulare") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Tulare County", subtitle = "2007-2019", color = "County", caption = "Data Source: CDC WONDER Online Database") + ylim(30,100)

#San Joaquin, Ranked #4 in Pesticide Use
MCH.CDC.Data %>% group_by(County) %>% filter(County == "San Joaquin") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in San Joaquin County", subtitle = "2007-2019", color = "County", caption = "Data Source: CDC WONDER Online Database") + ylim(30,100)

#Los Angeles, Comparison Group 
MCH.CDC.Data %>% group_by(County) %>% filter(County == "Los Angeles") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Los Angeles County", subtitle = "2007-2019", color = "County", caption = "Data Source: CDC WONDER Online Database") + ylim(30,100)

#Grid Plots
p1 <- MCH.CDC.Data %>% group_by(County) %>% filter(County == "Fresno") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Fresno County", subtitle = "2007-2019", color = "County") + theme(legend.position = "none") + ylim(30,100)

p2 <- MCH.CDC.Data %>% group_by(County) %>% filter(County == "Kern") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Kern County", subtitle = "2007-2019", color = "County") + theme(legend.position = "none") + ylim(30,100)

p3 <-MCH.CDC.Data %>% group_by(County) %>% filter(County == "Tulare") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Tulare County", subtitle = "2007-2019", color = "County") + theme(legend.position = "none") + ylim(30,100)

p4 <-MCH.CDC.Data %>% group_by(County) %>% filter(County == "Los Angeles") %>% ggplot(aes(Year, Fertility.Rate, color = County)) + geom_line() + labs( y="Fertility Rate", title = "Fertility Rate in Los Angeles County", subtitle = "2007-2019", color = "County") + theme(legend.position = "none") + ylim(30,100)

grid.arrange(p1, p2, p3, p4, bottom = "Data Source: CDC WONDER Online Database")

Racial Demographics

The next variable I examined was the racial demographics and I did so by stratifying the MCH.CDC.Data by race (via the MCH.CDC.Data_Race data frame) and viewing the total population of each race across all counties in the tile plots to assess if there was any one county with a more dense population of a certain race. I then viewed the racial demographic trends in our counties of interest which include Fresno, Kern, Tulare, San Joaquin (since they are continuously ranked as the top 4 in highest use of pesticides), and Los Angeles (as a control/comparison county) since LA is one of the most populated and most diverse counties in California. Fresno, Kern, Tulare, and San Joaquin counties are also all a part of San Joaquin Valley which we mentioned in our background and related work section to be of special interest to us because it is California’s most productive agricultural region and has one of the highest amounts of pesticide use.

#Racial Demographic GGPLOTS
MCH.CDC.Data_Race %>% filter(Mothers.Race == "American Indian or Alaska Native") %>%
    ggplot(aes(x = Year, y = County,  fill = Total.Population)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("American Indian or Alaska Native Population",
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Racial Demographics by County") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Asian or Pacific Islander") %>%
    ggplot(aes(x = Year, y = County,  fill = Total.Population)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Asian or Pacific Islander", 
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Racial Demographics by County") + 
    ylab("") + xlab("")

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Black or African American") %>%
    ggplot(aes(x = Year, y = County,  fill = Total.Population)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Black or African American Population",
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Racial Demographics by County") + 
    ylab("") + xlab("")

MCH.CDC.Data_Race %>% filter(Mothers.Race == "White") %>%
    ggplot(aes(x = Year, y = County,  fill = Total.Population)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("White Population",
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Racial Demographics by County") + 
    ylab("") + xlab("")

#Fresno, Ranked #1 in Pesticide Use
MCH.CDC.Data_Race %>% group_by(County) %>% filter(County == "Fresno") %>% ggplot(aes(Year, Total.Population, color = Mothers.Race)) + geom_line() + labs( y="Total Population (Log 10 Transformation)", title = "Racial Demographics in Fresno County (Transformed Y Axis)", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + scale_y_log10()

#Kern, Ranked #2 in Pesticide Use
MCH.CDC.Data_Race %>% group_by(County) %>% filter(County == "Kern") %>%  ggplot(aes(Year, Total.Population, color = Mothers.Race)) + geom_line() + labs( y="Total Population (Log 10 Transformation)", title = "Racial Demographics in Kern County (Transformed Y Axis)", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + scale_y_log10()

#Tulare, Ranked #3 in Pesticide Use
MCH.CDC.Data_Race %>% group_by(County) %>% filter(County == "Tulare") %>%  ggplot(aes(Year, Total.Population, color = Mothers.Race)) + geom_line() +  labs( y="Total Population (Log 10 Transformation)", title = "Racial Demographics in Tulare County (Transformed Y Axis)", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + scale_y_log10()

#San Joaquin, Ranked #4 in Pesticide Use
MCH.CDC.Data_Race %>% group_by(County) %>% filter(County == "San Joaquin") %>%  ggplot(aes(Year, Total.Population, color = Mothers.Race)) + geom_line()  + labs( y="Total Population (Log 10 Transformation)", title = "Racial Demographics in San Joaquin County (Transformed Y Axis)", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + scale_y_log10()

#Los Angeles, Comparison Group
MCH.CDC.Data_Race %>% group_by(County) %>% filter(County == "Los Angeles") %>%  ggplot(aes(Year, Total.Population, color = Mothers.Race)) + geom_line() + labs( y="Total Population (Log 10 Transformation)", title = "Racial Demographics in Los Angeles County (Transformed Y Axis)", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + scale_y_log10()

##### Fertility Rate (Stratified By Race) The next variable I examined was once again fertility rate but this time stratified by race. I first visualized the fertility rates across all counties over the years by each race in a tile plot. I then viewed the trends of fertility rates by race in our counties of interest which include Fresno, Kern, Tulare, San Joaquin (since they are continuously ranked as the top 4 in highest use of pesticides), and Los Angeles (as a control/comparison county) since LA is one of the most populated and most diverse counties in California. Fresno, Kern, Tulare, and San Joaquin counties are also all a part of San Joaquin Valley which we mentioned in our background and related work section to be of special interest to us because it is California’s most productive agricultural region and has one of the highest amounts of pesticide use.

#Race and Fertility GGPLOTS
MCH.CDC.Data_Race %>% filter(Mothers.Race == "American Indian or Alaska Native") %>%
ggplot(aes(x = Year, y = County,  fill = Fertility.Rate)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Fertility Rate", limits = c(0,115),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Fertility Rate by County for American Indian or Alaska Native Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Asian or Pacific Islander") %>%
ggplot(aes(x = Year, y = County,  fill = Fertility.Rate)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Fertility Rate",limits = c(0,115),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Fertility Rate by County for Asian or Pacific Islander Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Black or African American") %>%
ggplot(aes(x = Year, y = County,  fill = Fertility.Rate)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Fertility Rate", limits = c(0,115),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Fertility Rate by County for Black or African American Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "White") %>%
ggplot(aes(x = Year, y = County,  fill = Fertility.Rate)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Fertility Rate", limits = c(0,115),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Fertility Rate by County for White Pop.") + 
    ylab("") + xlab("") 

#Fresno, Ranked #1 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Fresno") %>% ggplot(aes(Year, Fertility.Rate, color = Mothers.Race)) + geom_line() + labs( y="Fertility Rate", title = "Race and Fertility Data in Fresno County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + ylim(0,115)

#Kern, Ranked #2 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Kern") %>% ggplot(aes(Year, Fertility.Rate, color = Mothers.Race)) + geom_line() + labs( y="Fertility Rate", title = "Race and Fertility Data in Kern County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + ylim(0,115)

#Tulare, Ranked #3 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Tulare") %>% ggplot(aes(Year, Fertility.Rate, color = Mothers.Race)) + geom_line() + labs( y="Fertility Rate", title = "Race and Fertility Data in Tulare County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + ylim(0,115)

#San Joaquin, Ranked #4 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "San Joaquin") %>% ggplot(aes(Year, Fertility.Rate, color = Mothers.Race)) + geom_line() + labs( y="Fertility Rate", title = "Race and Fertility Data in San Joaquin County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + ylim(0,115)

#Los Angeles, Comparison 
MCH.CDC.Data_Race %>% filter(County == "Los Angeles") %>% ggplot(aes(Year, Fertility.Rate, color = Mothers.Race)) + geom_line() + labs( y="Fertility Rate", title = "Race and Fertility Data in Los Angeles County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + ylim(0,115)

Preterm Birth (Stratified By Race)

The next variable I examined was preterm birth measured as LMP (last menstrual period) gestation age in weeks, and I added stratification by race. I first visualized the preterm birth across all counties over the years and then stratified the tile plots by race. I then viewed the trends of gestational age by race in our counties of interest which include Fresno, Kern, Tulare, San Joaquin (since they are continuously ranked as the top 4 in highest use of pesticides), and Los Angeles (as a control/comparison county) since LA is one of the most populated and most diverse counties in California. Fresno, Kern, Tulare, and San Joaquin counties are also all a part of San Joaquin Valley which we mentioned in our background and related work section to be of special interest to us because it is California’s most productive agricultural region and has one of the highest amounts of pesticide use. I also included a horizontal line for the county level stratified data at 37 weeks which is the cutoff for defining preterm birth.

#Preterm Birth GGPLOTS (With Race)
MCH.CDC.Data %>% 
    ggplot(aes(x = Year, y = County,  fill = Average.LMP.Gestational.Age)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Gestational Age (LMP) in Weeks", limits = c(36,40),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Gestational Age by County") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "American Indian or Alaska Native") %>%
ggplot(aes(x = Year, y = County,  fill = Average.LMP.Gestational.Age)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Gestational Age (LMP) in Weeks", limits = c(36,40),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Gestational Age by County for American Indian or Alaska Native Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Asian or Pacific Islander") %>%
ggplot(aes(x = Year, y = County,  fill = Average.LMP.Gestational.Age)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Gestational Age (LMP) in Weeks", limits = c(36,40),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Gestational Age by County for Asian or Pacific Islander Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Black or African American") %>%
ggplot(aes(x = Year, y = County,  fill = Average.LMP.Gestational.Age)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Gestational Age (LMP) in Weeks", limits = c(36,40),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Gestational Age by County for Black or African American Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "White") %>%
ggplot(aes(x = Year, y = County,  fill = Average.LMP.Gestational.Age)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Gestational Age (LMP) in Weeks", limits = c(36,40),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Gestational Age by County for White Pop.") + 
    ylab("") + xlab("") 

#Fresno, Ranked #1 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Fresno") %>% ggplot(aes(Year, Average.LMP.Gestational.Age, color = Mothers.Race)) + geom_line() + labs( y="Average LMP Gestational Age (Weeks)", title = "Preterm Birth in Fresno County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + geom_hline(yintercept = 37, size =2) + ylim(36,40)

#Kern, Ranked #2 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Kern") %>% ggplot(aes(Year, Average.LMP.Gestational.Age, color = Mothers.Race)) + geom_line() + labs( y="Average LMP Gestational Age (Weeks)", title = "Preterm Birth in Kern County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database")  + geom_hline(yintercept = 37, size =2) + ylim(36,40)

#Tulare, Ranked #3 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Tulare") %>% ggplot(aes(Year, Average.LMP.Gestational.Age, color = Mothers.Race)) + geom_line() + labs( y="Average LMP Gestational Age (Weeks)", title = "Preterm Birth in Tulare", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database")  + geom_hline(yintercept = 37, size =2) + ylim(36,40)

#San Joaquin, Ranked #4 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "San Joaquin") %>% ggplot(aes(Year, Average.LMP.Gestational.Age, color = Mothers.Race)) + geom_line() + labs( y="Average LMP Gestational Age (Weeks)", title = "Preterm Birth in San Joaquin County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + geom_hline(yintercept = 37, size =2) + ylim(36,40)

#Los Angeles County, Comparison Group
MCH.CDC.Data_Race %>% filter(County == "Los Angeles") %>% ggplot(aes(Year, Average.LMP.Gestational.Age, color = Mothers.Race)) + geom_line() + labs( y="Average LMP Gestational Age (Weeks)", title = "Preterm Birth in Los Angeles County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database")  + geom_hline(yintercept = 37, size =2) + ylim(36,40)

#Preterm Birth Cutoff is 37 Weeks (Horizontal Line)
Birth Weight (Stratified By Race)

The next variable I examined was birth weight measured in grams, and I added stratification by race. I first visualized the birth weight across all counties over the years and then stratified the tile plots by race. I then viewed the trends of birth weight by race in our counties of interest which include Fresno, Kern, Tulare, San Joaquin (since they are continuously ranked as the top 4 in highest use of pesticides), and Los Angeles (as a control/comparison county) since LA is one of the most populated and most diverse counties in California. Fresno, Kern, Tulare, and San Joaquin counties are also all a part of San Joaquin Valley which we mentioned in our background and related work section to be of special interest to us because it is California’s most productive agricultural region and has one of the highest amounts of pesticide use. I also included a horizontal line for the county level stratified data at 25000 grams which is the cutoff for defining low birth weight.

#Birth weight GGPLOTS (With Race)
MCH.CDC.Data %>% 
    ggplot(aes(x = Year, y = County,  fill = Average.Birth.Weight)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Birth Weight in Grams", limits = c(2400, 3600),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Birth Weight by County") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "American Indian or Alaska Native") %>%
ggplot(aes(x = Year, y = County,  fill = Average.Birth.Weight)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Birth Weight in Grams", limits = c(2400, 3600),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Birth Weight by County for American Indian or Alaska Native Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Asian or Pacific Islander") %>%
ggplot(aes(x = Year, y = County,  fill = Average.Birth.Weight)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Birth Weight in Grams", limits = c(2400, 3600),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Birth Weight by County for Asian or Pacific Islander Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "Black or African American") %>%
ggplot(aes(x = Year, y = County,  fill = Average.Birth.Weight)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Birth Weight Grams", limits = c(2400, 3600),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Birth Weight by County for Black or African American Pop.") + 
    ylab("") + xlab("") 

MCH.CDC.Data_Race %>% filter(Mothers.Race == "White") %>%
ggplot(aes(x = Year, y = County,  fill = Average.Birth.Weight)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0,0)) +
    scale_fill_gradientn("Average Birth Weight in Grams", limits = c(2400, 3600),
                         colors = brewer.pal(9, "Reds")) +
    theme_minimal() +  
    theme(panel.grid = element_blank()) +
    ggtitle("Average Birth Weight by County for White Pop.") + 
    ylab("") + xlab("") 

#Fresno, Ranked #1 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Fresno") %>% ggplot(aes(Year, Average.Birth.Weight, color = Mothers.Race)) + geom_line() + labs( y="Average Birth Weight (Grams)", title = "Birth Weight Data in Fresno County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + geom_hline(yintercept = 2500, size =2)+ ylim(2400, 3600)

#Kern, Ranked #2 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Kern") %>% ggplot(aes(Year, Average.Birth.Weight, color = Mothers.Race)) + geom_line() + labs( y="Average Birth Weight (Grams)", title = "Birth Weight Data in Kern County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database")  + geom_hline(yintercept = 2500, size =2)+ ylim(2400, 3600)

#Tulare, Ranked #3 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "Tulare") %>% ggplot(aes(Year, Average.Birth.Weight, color = Mothers.Race)) + geom_line() + labs( y="Average Birth Weight (Grams)", title = "Birth Weight Data in Tulare", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database")  + geom_hline(yintercept = 2500, size =2)+ ylim(2400, 3600)

#San Joaquin, Ranked #4 in Pesticide Use
MCH.CDC.Data_Race %>% filter(County == "San Joaquin") %>% ggplot(aes(Year, Average.Birth.Weight, color = Mothers.Race)) + geom_line() + labs( y="Average Birth Weight (Grams)", title = "Birth Weight Data in San Joaquin County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database") + geom_hline(yintercept = 2500, size =2)+ ylim(2400, 3600)

#Los Angeles, Comparison Group
MCH.CDC.Data_Race %>% filter(County == "Los Angeles") %>% ggplot(aes(Year, Average.Birth.Weight, color = Mothers.Race)) + geom_line() + labs( y="Average Birth Weight (Grams)", title = "Birth Weight Data in Los Angeles County", subtitle = "2007-2019", color = "Mother's Race", caption = "Data Source: CDC WONDER Online Database")  + geom_hline(yintercept = 2500, size =2) + ylim(2400, 3600)

#LBW Cutoff is 2500 Grams (Horizontal Line)

Pesticide Data Wrangling

The CDPR creates a Summary of Pesticide Use Data every year that details pesticide usage throughout the state. Data from the 2016 report onward were available directly from a database linked in the report. For data prior to 2016, the one would need to navigate multiple large zip files before being able to find what they need. After many fruitless attempts to navigate the file system linked in the documents, I decided to get the desired table (Table 1, which shows the total pounds of pesticide active ingredients reported in each county and that county’s rank in the current year and the previous year) directly from the documents using the pdftools package and string processing. I read in the pages from each pdf and wrote the tables to csv file (“table1_YEAR.csv”) using the code in import_data2.R. These csv files were then combined to create a single table for pesticide usage from 2007 to 2016. I also imported a table of all the pesticides determined to cause reproductive harm according to Proposition 65. We were not able to find a use for that information.

county_ranks16 <- read_delim("table1_county_rank_2016.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
repro_lbs16 <- read_delim("table3_reproductive_lbs_2016.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
repro_acre16 <- read_delim("table4_reproductive_acres_2016.txt", "\t", escape_double = FALSE, trim_ws = TRUE)


table1_2016 <- county_ranks16 %>% transmute(county = COUNTY, 
                                     lbs_2015 = LBS_2015, rank_2015 = RANK_2015, 
                                     lbs_2016 = LBS_2016, rank_2016 = RANK_2016)
# column 1 is the county
# columns 2-3 have the previous year data
# columns 4-5 have the current year data

# we only want columns 1-3 for the most up-to-date data for all years before 2016
all_dat <- list(read_csv("table1_2007.csv")[1:3],
                read_csv("table1_2008.csv")[1:3],
                read_csv("table1_2009.csv")[1:3],
                read_csv("table1_2010.csv")[1:3],
                read_csv("table1_2011.csv")[1:3],
                read_csv("table1_2012.csv")[1:3],
                read_csv("table1_2013.csv")[1:3],
                read_csv("table1_2014.csv")[1:3],
                read_csv("table1_2015.csv")[1:3])


table1 <- Reduce(function(x, y) left_join(x, y, by = "county"), all_dat)


long_table1 <- table1 %>% pivot_longer(!county, names_to = 'usage', values_to = "value")
#table1_ranks <- long_table1 %>% filter(str_starts(usage, "rank"))
table1_lbs <- long_table1 %>% filter(str_starts(usage, "lbs"))

long_table2 <- table1_2016 %>% pivot_longer(!county, names_to = 'usage', values_to = "value")
#table1_ranks_1516 <- long_table2 %>% filter(str_starts(usage, "rank"))
table1_lbs_1516<- long_table2 %>% filter(str_starts(usage, "lbs"))

table1_lbs$usage <- as.numeric(gsub("[^[:digit:]]+", "", table1_lbs$usage))
table1_lbs_1516$usage <- as.numeric(gsub("[^[:digit:]]+", "", table1_lbs_1516$usage))
combined_pesticide_use <- table1_lbs %>% full_join(table1_lbs_1516) 
class(combined_pesticide_use$usage) 
## [1] "numeric"
combined_pesticide_use <- combined_pesticide_use %>% group_by(usage) 
combined_pesticide_use <- combined_pesticide_use %>% arrange(usage)

LBW Data from CDC (Wrangling for use in ShinyApp):

This is my data wrangling process for low birth weight for the CDC WONDER database. By default, CDC WONDER live birth database only displayed counties that had a county population >100,000. I only looked at low birth rate here and this is for my shiny app bar graph.

#MCH_CDC Data for low birth weight 
#data wrangling mch cdc data
cdc_lowbirthweight <- read.delim("MCH CDC Data.txt",  sep ="\t", dec=".", header = TRUE, stringsAsFactors = FALSE)
cdc_lowbirthweight  <- cdc_lowbirthweight [-c(482:538), ]
cdc_lowbirthweight  <- cdc_lowbirthweight [ ,-c(1, 3, 5, 7)]
MCH.CDC.Data.Total <- read.delim("MCH CDC Data Total.txt",  sep ="\t", dec=".", header = TRUE, stringsAsFactors = FALSE)
MCH.CDC.Data.Total <- MCH.CDC.Data.Total[,-c(1, 3, 5)]

#I noticed that the data I downloaded did not include total # of births so merging two datasets (one that has total # of birth counts and the other with low birth weight +very low birth weight counts)
df1 <- full_join(cdc_lowbirthweight , MCH.CDC.Data.Total, by=c("Year", "County"))
df1<- df1 %>% rename("cases" = "Births.x", "total_births" = "Births.y")

#Note: LBW = Low birth weight + Very low birth weight counts; Total Births = Total # of Birth
col_order <- c("Year", "County", "total_births",
               "cases", "Average.Birth.Weight", "Standard.Deviation.for.Average.Birth.Weight",
               "Average.Age.of.Mother", "Standard.Deviation.for.Average.Age.of.Mother","Average.LMP.Gestational.Age",
               "Standard.Deviation.for.Average.LMP.Gestational.Age")
df2 <- df1[,col_order]
df2[df2$County == "Alameda County, CA", "County"] <-"alameda"
df2[df2$County == "Butte County, CA", "County"] <-"butte"
df2[df2$County == "Contra Costa County, CA", "County"] <-"contra costa"
df2[df2$County == "El Dorado County, CA", "County"] <-"el dorado"
df2[df2$County == "Fresno County, CA", "County"] <-"fresno"
df2[df2$County == "Humboldt County, CA", "County"] <-"humboldt"
df2[df2$County == "Imperial County, CA", "County"] <-"imperial"
df2[df2$County == "Kern County, CA", "County"] <-"kern"
df2[df2$County == "Kings County, CA", "County"] <-"kings"
df2[df2$County == "Los Angeles County, CA", "County"] <-"los angeles"
df2[df2$County == "Madera County, CA", "County"] <-"madera"
df2[df2$County == "Marin County, CA", "County"] <-"marin"
df2[df2$County == "Contra Costa County, CA", "County"] <-"mariposa"
df2[df2$County == "Merced County, CA", "County"] <-"merced"
df2[df2$County == "Monterey County, CA", "County"] <-"monterey"
df2[df2$County == "Napa County, CA", "County"] <-"napa"
df2[df2$County == "Orange County, CA", "County"] <-"orange"
df2[df2$County == "Placer County, CA", "County"] <-"placer"
df2[df2$County == "Riverside County, CA", "County"] <-"riverside"
df2[df2$County == "Sacramento County, CA", "County"] <-"sacramento"
df2[df2$County == "San Bernardino County, CA", "County"] <-"san bernardino"
df2[df2$County == "San Diego County, CA", "County"] <-"san diego"
df2[df2$County == "San Francisco County, CA", "County"] <-"san francisco"
df2[df2$County == "San Joaquin County, CA", "County"] <-"san joaquin"
df2[df2$County == "San Luis Obispo County, CA", "County"] <-"san luis obispo"
df2[df2$County == "San Mateo County, CA", "County"] <-"san mateo"
df2[df2$County == "Santa Barbara County, CA", "County"] <-"santa barbara"
df2[df2$County == "Santa Clara County, CA", "County"] <-"santa clara"
df2[df2$County == "Santa Cruz County, CA", "County"] <-"santa cruz"
df2[df2$County == "Shasta County, CA", "County"] <-"shasta"
df2[df2$County == "Solano County, CA", "County"] <-"solano"
df2[df2$County == "Sonoma County, CA", "County"] <-"sonoma"
df2[df2$County == "Stanislaus County, CA", "County"] <-"stanislaus"
df2[df2$County == "Tulare County, CA", "County"] <-"tulare"
df2[df2$County == "Ventura County, CA", "County"] <-"ventura"
df2[df2$County == "Yolo County, CA", "County"] <-"yolo"
df2 <- df2 %>% filter(!is.na(total_births)) %>% filter(!is.na(cases)) %>% mutate(rate = cases/total_births * 10^2)
df2$County <- df2$County %>% str_to_title()
Preterm Birth Data from CDC Wonder Database (Wrangling for use in ShinyApp)::

This is my data wrangling process for preterm birth for the CDC WONDER database. By default, CDC WONDER live birth database only displayed counties that had a county population >100,000. I only looked at preterm birth here and this is for my shiny app bar graph.

cdc_pretermbirth <- read.delim("Preterm birth.txt",  sep ="\t", dec=".", header = TRUE, stringsAsFactors = FALSE)
cdc_pretermbirth  <- cdc_pretermbirth [-c(422:472), ]
cdc_pretermbirth  <- cdc_pretermbirth [ ,-c(1, 3, 5)]
cdc_pretermbirth <- cdc_pretermbirth %>% rename("Events" = "Births")
MCH.CDC.Data.Total <- read.delim("MCH CDC Data Total.txt",  sep ="\t", dec=".", header = TRUE, stringsAsFactors = FALSE)
MCH.CDC.Data.Total <- MCH.CDC.Data.Total[,-c(1, 3, 5)]
MCH.CDC.Data.Total <- MCH.CDC.Data.Total %>% rename("total_birth" = "Births")

df1_pt <- full_join(cdc_pretermbirth , MCH.CDC.Data.Total, by=c("Year", "County"))

df1_pt[df1_pt$County == "Alameda County, CA", "County"] <-"alameda"
df1_pt[df1_pt$County == "Butte County, CA", "County"] <-"butte"
df1_pt[df1_pt$County == "Contra Costa County, CA", "County"] <-"contra costa"
df1_pt[df1_pt$County == "El Dorado County, CA", "County"] <-"el dorado"
df1_pt[df1_pt$County == "Fresno County, CA", "County"] <-"fresno"
df1_pt[df1_pt$County == "Humboldt County, CA", "County"] <-"humboldt"
df1_pt[df1_pt$County == "Imperial County, CA", "County"] <-"imperial"
df1_pt[df1_pt$County == "Kern County, CA", "County"] <-"kern"
df1_pt[df1_pt$County == "Kings County, CA", "County"] <-"kings"
df1_pt[df1_pt$County == "Los Angeles County, CA", "County"] <-"los angeles"
df1_pt[df1_pt$County == "Madera County, CA", "County"] <-"madera"
df1_pt[df1_pt$County == "Marin County, CA", "County"] <-"marin"
df1_pt[df1_pt$County == "Contra Costa County, CA", "County"] <-"mariposa"
df1_pt[df1_pt$County == "Merced County, CA", "County"] <-"merced"
df1_pt[df1_pt$County == "Monterey County, CA", "County"] <-"monterey"
df1_pt[df1_pt$County == "Napa County, CA", "County"] <-"napa"
df1_pt[df1_pt$County == "Orange County, CA", "County"] <-"orange"
df1_pt[df1_pt$County == "Placer County, CA", "County"] <-"placer"
df1_pt[df1_pt$County == "Riverside County, CA", "County"] <-"riverside"
df1_pt[df1_pt$County == "Sacramento County, CA", "County"] <-"sacramento"
df1_pt[df1_pt$County == "San Bernardino County, CA", "County"] <-"san bernardino"
df1_pt[df1_pt$County == "San Diego County, CA", "County"] <-"san diego"
df1_pt[df1_pt$County == "San Francisco County, CA", "County"] <-"san francisco"
df1_pt[df1_pt$County == "San Joaquin County, CA", "County"] <-"san joaquin"
df1_pt[df1_pt$County == "San Luis Obispo County, CA", "County"] <-"san luis obispo"
df1_pt[df1_pt$County == "San Mateo County, CA", "County"] <-"san mateo"
df1_pt[df1_pt$County == "Santa Barbara County, CA", "County"] <-"santa barbara"
df1_pt[df1_pt$County == "Santa Clara County, CA", "County"] <-"santa clara"
df1_pt[df1_pt$County == "Santa Cruz County, CA", "County"] <-"santa cruz"
df1_pt[df1_pt$County == "Shasta County, CA", "County"] <-"shasta"
df1_pt[df1_pt$County == "Solano County, CA", "County"] <-"solano"
df1_pt[df1_pt$County == "Sonoma County, CA", "County"] <-"sonoma"
df1_pt[df1_pt$County == "Stanislaus County, CA", "County"] <-"stanislaus"
df1_pt[df1_pt$County == "Tulare County, CA", "County"] <-"tulare"
df1_pt[df1_pt$County == "Ventura County, CA", "County"] <-"ventura"
df1_pt[df1_pt$County == "Yolo County, CA", "County"] <-"yolo"
df1_pt <- df1_pt %>% mutate(County = str_to_title(County))
df1_pt <- df1_pt %>% filter(!is.na("total_birth")) %>% filter(!is.na(Events)) %>% mutate(rate = Events/total_birth * 10^2)

Data Joining of CDC WONDER Live Birth Data (Low Birth Weight and Pre-Term Birth) and Pesticide Data

I then joined the CDC WONDER data (low birth weight and preterm birth) and Zainab’s wrangled pesticide data to come up with a joint data. I then generated bar graphs to visualize the trend across a span of 2007-2016 (Please see shiny app). We noticed that Fresno and Kern county were the two top counties that used the highest amounts of pesticide and found out that San Joaquin Valley is a region that’s agriculturally productive.

county_ranks16 <- read_delim("table1_county_rank_2016.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
repro_lbs16 <- read_delim("table3_reproductive_lbs_2016.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
repro_acre16 <- read_delim("table4_reproductive_acres_2016.txt", "\t", escape_double = FALSE, trim_ws = TRUE)

table1_2016 <- county_ranks16 %>% transmute(county = COUNTY, 
                                            lbs_2015 = LBS_2015, rank_2015 = RANK_2015, 
                                            lbs_2016 = LBS_2016, rank_2016 = RANK_2016)

all_dat <- list(read_csv("table1_2007.csv")[1:3],
                read_csv("table1_2008.csv")[1:3],
                read_csv("table1_2009.csv")[1:3],
                read_csv("table1_2010.csv")[1:3],
                read_csv("table1_2011.csv")[1:3],
                read_csv("table1_2012.csv")[1:3],
                read_csv("table1_2013.csv")[1:3],
                read_csv("table1_2014.csv")[1:3],
                read_csv("table1_2015.csv")[1:3])


table1 <- Reduce(function(x, y) left_join(x, y, by = "county"), all_dat)

long_table1 <- table1 %>% pivot_longer(!county, names_to = 'usage', values_to = "value")
table1_ranks <- long_table1 %>% filter(str_starts(usage, "rank"))
table1_lbs <- long_table1 %>% filter(str_starts(usage, "lbs"))
table1_lbs$usage <- as.numeric(gsub("[^[:digit:]]+", "", table1_lbs$usage))

long_table2 <- table1_2016 %>% pivot_longer(!county, names_to = 'usage', values_to = "value")
table1_ranks_1516 <- long_table2 %>% filter(str_starts(usage, "rank"))
table1_lbs_1516<- long_table2 %>% filter(str_starts(usage, "lbs"))


table1_lbs_1516$usage <- as.numeric(gsub("[^[:digit:]]+", "", table1_lbs_1516$usage))
combined_pesticide_use <- table1_lbs %>% full_join(table1_lbs_1516) 
combined_pesticide_use <- combined_pesticide_use %>% group_by(usage) 
combined_pesticide_use <- combined_pesticide_use %>% arrange(usage)

averagebw <-df2 %>% select("County", "Year", "rate")
pesticide_averagebw_join <- averagebw %>% inner_join(combined_pesticide_use, by = c("County" = "county", "Year" = "usage")) 

averagept <-df1_pt %>% select("County", "Year", "rate")
pesticide_averagept_join <- averagept %>% inner_join(combined_pesticide_use, by = c("County" = "county", "Year" = "usage")) 

#bar graph of low birth weight
pesticide_averagebw_join %>% ggplot(aes(County, rate)) + geom_col() + ylab("Low Birth Weight Rate (%)") +xlab("") +
                theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1/2)) 

#bar graph of pesticide 
pesticide_averagept_join %>% ggplot(aes(County, value)) + geom_col() + ylab("Pesticide Use (Pounds)") +xlab("") + 
            theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1/2))

Code for Creating the Map - Leaflet Map Using Preterm Birth Data I Wrangled Earlier (See Shiny App for the Final Result)

This is a similar spatial map but for pesticide use. Following is the wrangling process of generating a map that shapes the map of California, and merging that spatial object with pesticide use that I wrangled earlier to generate a leaflet map.

averagept_df <- combined_pesticide_use %>% filter(usage == "2016")

map <- readOGR(path.expand("cb_2018_us_county_20m.shp"),
               layer = "cb_2018_us_county_20m", stringsAsFactors = FALSE)
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\zay-z\Documents\Harvard Chan\Fall 2020\BST260\datascience-project\Data Prep (& Final RMD)\cb_2018_us_county_20m.shp", layer: "cb_2018_us_county_20m"
## with 3220 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER
Statekey<-read.csv('./STATEFPtoSTATENAME_Key.csv', colClasses=c('character'))
map<-merge(x=map, y=Statekey, by="STATEFP", all=TRUE)
SingleState <- subset(map, map$STATENAME %in% c(
    "California"
))

spatial_pesticide <-sp::merge(x=SingleState, y=averagept_df, by.x="NAME", by.y="county", by=x)

binpes <- c(200, 100145, 1131454, 3345277, Inf)
pal3 <- colorBin(
    palette = "magma",
    domain = spatial_pesticide$value, n=7, bins=binpes)

leaflet(spatial_pesticide, options = leafletOptions(zoomControl = TRUE, zoomLevelFixed = FALSE, dragging=TRUE, minZoom = 5.3, maxZoom = 9)) %>% 
                setView(lat = 36.778259, lng = -119.417931, zoom = 6) %>%
                addPolygons(color = "Black", weight = 1, smoothFactor = 0.5, 
                            opacity = 1.0, fillOpacity = 0.5, layerId = ~NAME,
                            fillColor = ~pal3(value), 
                            popup = ~as.factor(paste0("<b><font size=\"4\"><center>County: </b>",spatial_pesticide$NAME,"</font></center>","Amounts of Pesticides used </b>", sprintf("%1.2f", spatial_pesticide$value),"<br/>"))) %>%
                addLegend(pal = pal3, values = spatial_pesticide$value, opacity = 1, title="Amounts of Pesticide Used (Pounds)")

Regression Analysis

#Top 10 Counties in term of pesticide usage, according to CFDA
agro <- c("Kern", "Tulare", "Fresno", "Monterey", "Merced", "Stanislaus", 
          "San Joaquin", "Ventura", "Imperial", "Kings")

mch_regression <- MCH.CDC.Data_Race %>% 
  filter(Year == 2016) %>%
  mutate(agricultural = ifelse(County %in% agro, 1, 0))

First we fit a model, stratified by agricultural ranking.

#best model in terms of p-values and adjusted R-squared, contains and interaction term between gestational age and top ten status
linmod <- lm(Average.Birth.Weight ~ Average.LMP.Gestational.Age + factor(agricultural) + Average.LMP.Gestational.Age * factor(agricultural), mch_regression)
summary(linmod)[4]
## $coefficients
##                                                     Estimate Std. Error
## (Intercept)                                       -6880.0613  995.26975
## Average.LMP.Gestational.Age                         262.4304   25.70411
## factor(agricultural)1                              5256.6615 1888.23917
## Average.LMP.Gestational.Age:factor(agricultural)1  -136.4639   48.80666
##                                                     t value     Pr(>|t|)
## (Intercept)                                       -6.912760 1.795659e-10
## Average.LMP.Gestational.Age                       10.209666 1.996331e-18
## factor(agricultural)1                              2.783896 6.154339e-03
## Average.LMP.Gestational.Age:factor(agricultural)1 -2.796011 5.940801e-03
summary(linmod)[9]
## $adj.r.squared
## [1] 0.4615328
#On average, babies born to mothers living in a top ten agricultural county have an average birth weight less than those born to mothers that do not live in a top ten agricultural county, for a given LMP gestational age. As average LMP gestational age increases, average birth weight also increases.
mch_regression %>%
  ggplot(aes(Average.LMP.Gestational.Age, Average.Birth.Weight, color = factor(agricultural))) + 
  geom_point() + 
  geom_line(aes(y = predict(linmod)))  + 
  xlab("Average LMP Gestational Age (weeks)") + 
  ylab("Average Birth Weight (grams)") + 
  scale_color_discrete(name = "Top Ten\nAgricultural\nCounty", labels = c('No', "Yes")) +
  ggtitle("Birth Weight Outcomes by Agriculture Category") + 
  ylim(2980, 3520) +
  xlim(37.6, 39.6)

Next we fit a models stratified by mother’s race.

#Black and Asian/Pacific Island populations tend to have lower average birth weights
mch_regression %>%
  ggplot(aes(Average.LMP.Gestational.Age, Average.Birth.Weight, color = Mothers.Race)) + 
  geom_point() +  
  xlab("Average LMP Gestational Age (weeks)") + 
  ylab("Average Birth Weight (grams)") + 
  scale_color_discrete(name = "Mother's Race") +
  ggtitle("Birth Weight Outcomes by Race") + 
  ylim(2980, 3520) +
  xlim(37.6, 39.6)

#simple linear model more parsimonious than the one that has the interaction term for American Indian/Alaska Native Mothers
#tho there are low populations for this group, so the numbers in the data may have a lot of variability among different years
amerindian_mod <- lm(Average.Birth.Weight ~ Average.LMP.Gestational.Age, filter(mch_regression, Mothers.Race == "American Indian or Alaska Native"))
summary(amerindian_mod)[4] #coefficients
## $coefficients
##                               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                 -3829.7135 1246.73649 -3.071791 4.495686e-03
## Average.LMP.Gestational.Age   184.9774   32.20391  5.743941 2.857815e-06
summary(amerindian_mod)[9] #adjusted r-squared
## $adj.r.squared
## [1] 0.5078807
q1 <- mch_regression %>%
  filter(Mothers.Race == "American Indian or Alaska Native") %>%
  ggplot(aes(Average.LMP.Gestational.Age, Average.Birth.Weight, color = factor(agricultural))) + 
  geom_point() + geom_line(aes(y = predict(amerindian_mod)), size = 1) + 
  xlab("Average LMP Gestational Age (weeks)") + 
  ylab("Average Birth Weight (grams)") + 
  scale_color_discrete(name = "Top Ten\nAgricultural\nCounty", labels = c('No', "Yes")) +
  ggtitle("Birth Weight Outcome for American Indian and Alaska Native Mothers") + 
  ylim(2980, 3520) +
  xlim(37.6, 39.6)
#even simple linear regression doesn't explain a lot of the errors for Asian mothers
asian_mod <- lm(Average.Birth.Weight ~ Average.LMP.Gestational.Age, filter(mch_regression, Mothers.Race == "Asian or Pacific Islander"))
summary(asian_mod)[4] #coefficients
## $coefficients
##                               Estimate Std. Error    t value   Pr(>|t|)
## (Intercept)                 -429.60128 1658.17961 -0.2590801 0.79718280
## Average.LMP.Gestational.Age   94.23017   42.87789  2.1976402 0.03509986
summary(asian_mod)[9] #adjusted r-squared
## $adj.r.squared
## [1] 0.1012334
q2 <- mch_regression %>%
  filter(Mothers.Race == "Asian or Pacific Islander") %>%
  ggplot(aes(Average.LMP.Gestational.Age, Average.Birth.Weight, color = factor(agricultural))) + 
  geom_point() + 
  geom_line(aes(y = predict(asian_mod)), size = 1) + 
  xlab("Average LMP Gestational Age (weeks)") + 
  ylab("Average Birth Weight (grams)") + 
  scale_color_discrete(name = "Top Ten\nAgricultural\nCounty", labels = c('No', "Yes")) +
  ggtitle("Birth Weight Outcome for Asian and Pacific Islander Mothers") + 
  ylim(2980, 3520) +
  xlim(37.6, 39.6)
black_mod <- lm(Average.Birth.Weight ~ Average.LMP.Gestational.Age + factor(agricultural) + Average.LMP.Gestational.Age*factor(agricultural), filter(mch_regression, Mothers.Race == "Black or African American"))
summary(black_mod)[4] #coefficients
## $coefficients
##                                                     Estimate Std. Error
## (Intercept)                                       -5672.4322 1361.94810
## Average.LMP.Gestational.Age                         230.1197   35.26328
## factor(agricultural)1                              5451.8961 2461.08502
## Average.LMP.Gestational.Age:factor(agricultural)1  -142.4039   63.79834
##                                                     t value     Pr(>|t|)
## (Intercept)                                       -4.164940 2.304953e-04
## Average.LMP.Gestational.Age                        6.525760 2.774693e-07
## factor(agricultural)1                              2.215241 3.422210e-02
## Average.LMP.Gestational.Age:factor(agricultural)1 -2.232094 3.297249e-02
summary(black_mod)[9] #adjusted r-squared
## $adj.r.squared
## [1] 0.585938
q3 <- mch_regression %>%
  filter(Mothers.Race == "Black or African American") %>%
  ggplot(aes(Average.LMP.Gestational.Age, Average.Birth.Weight, color = factor(agricultural))) + 
  geom_point() + geom_line(aes(y = predict(black_mod)), size = 1) + 
  xlab("Average LMP Gestational Age (weeks)") + 
  ylab("Average Birth Weight (grams)") + 
scale_color_discrete(name = "Top Ten\nAgricultural\nCounty", labels = c('No', "Yes")) +
  ggtitle("Birth Weight Outcome for Black Mothers") +
  ylim(2980, 3520) +
  xlim(37.6, 39.6)
white_mod <- lm(Average.Birth.Weight ~ Average.LMP.Gestational.Age + factor(agricultural) + Average.LMP.Gestational.Age*factor(agricultural), filter(mch_regression, Mothers.Race == "White"))
summary(white_mod)[4] #coefficients
## $coefficients
##                                                     Estimate Std. Error
## (Intercept)                                       -5250.8344 1098.58067
## Average.LMP.Gestational.Age                         221.3943   28.26190
## factor(agricultural)1                              6590.0755 2589.80533
## Average.LMP.Gestational.Age:factor(agricultural)1  -170.1409   66.77334
##                                                     t value     Pr(>|t|)
## (Intercept)                                       -4.779653 4.035348e-05
## Average.LMP.Gestational.Age                        7.833668 7.687165e-09
## factor(agricultural)1                              2.544622 1.613772e-02
## Average.LMP.Gestational.Age:factor(agricultural)1 -2.548036 1.600829e-02
summary(white_mod)[9] #adjusted r-squared
## $adj.r.squared
## [1] 0.6877013
q4 <- mch_regression %>%
  filter(Mothers.Race == "White" ) %>%
  ggplot(aes(Average.LMP.Gestational.Age, Average.Birth.Weight, color = factor(agricultural))) + 
  geom_point() + geom_line(aes(y = predict(white_mod)), size = 1) + 
  xlab("Average LMP Gestational Age (weeks)") + 
  ylab("Average Birth Weight (grams)") + 
scale_color_discrete(name = "Top Ten\nAgricultural\nCounty", labels = c('No', "Yes")) +
  ggtitle("Birth Weight Outcome for White Mothers") +
  ylim(2980, 3520) +
  xlim(37.6, 39.6)
#Imperial appears to be an influential point in the plot for Asian, Black, and White mothers
q1 #very small population, but the line fits well

q2 #simplest linear model still not good :()

q3 #appears to be a disparity, but Imperial my also be affecting the model

q4 #least variation in birth weights and gestational ages

Checking Assumptions for Linear Models by Checking Residuals

Our data set is small, about 50 counties total, so it was not feasible to divide the data into training and testing sets. I am checking the residuals to make sure the linear models are appropriate.

#LINE assumptions met, Black mothers in Imperial (26) appear to be influential
plot(linmod)

# residuals definitely skewed, Imperial (7) and Humboldt (6) cause problems
#linear model not appropriate at all at all :(
plot(asian_mod)

#LINE assumptions satisfied
plot(amerindian_mod)

# Imperial (7) strikes again! besides that it's good
plot(black_mod)

# Imperial (7), other than that it's good
plot(white_mod)